EDA Package Review
EDA Package Review
Introduction
This document simply evaluates the capabilities of a variety of R EDA packages. The goal is to select a subset to use in future projects consistently.
Data
Will use data provided by Robbie used the the mortgage lead score prediction algorithm.
combined_data <- read_csv("00_data/combined-data.csv")
#Remove columns not needed
myData <- combined_data %>% dplyr::select(-c(contains("phone")))
myData <- myData %>% dplyr::select(-c("leadid", "eventually_apply", "application_dt", "funded_dt",
"dialed_number", "start_date_time"))
glimpse(myData)## Observations: 45,562
## Variables: 11
## $ lead_created_date <dttm> 2018-05-25 13:40:00, 2018-08-15 22:33:00, ...
## $ loan_purpose <chr> "REFI", "REFI", "REFI", "Purchase", "Purcha...
## $ self_reported_fico <dbl> 700, 640, 750, 720, 640, 750, 720, 700, 680...
## $ loan_to_value <dbl> 86.9, 91.1, 73.5, 84.6, 97.0, 97.5, 95.0, 9...
## $ loan_amount <dbl> 152000, 410000, 238902, 330000, 80510, 3900...
## $ property_state <chr> "FL", "CA", "NC", "CT", "GA", "IL", "CA", "...
## $ existing_customer <chr> "Yes", "Unknown", "Unknown", "Yes", "Unknow...
## $ time_to_purchase <chr> "N/A", "N/A", "N/A", "Unknown", "Unknown", ...
## $ eventually_fund <chr> "No", "No", "No", "No", "No", "No", "No", "...
## $ lead_source_code <chr> "Marketing-Email-Web", "Bankrate", "Creditq...
## $ time_lead_to_call <dbl> 345, 994, 1323, 932, 874, 307, 377, 1212, 3...
Quick Aside
The code below was built using the original raw data. Nice code - took me a while to split a long table into 2. This might be helpful in the future. (It seems pretty simple now - perhaps I should be embarrassed.)
library(kableExtra)
library(dplyr)
tmpSources <- unique(myData$lead_source_code)
tmpSources <- as.data.frame(tmpSources)
# kable(tmpSources[1:15,])
# kable(tmpSources[1:floor(nrow(tmpSources)/2),])
kable1 <- tmpSources[1:floor(nrow(tmpSources)/2),]
kable2 <- tmpSources[floor(nrow(tmpSources)/2+1):nrow(tmpSources),]
# kable(list(kable1, kable2))
kable(list(kable1, kable2), col.names = ("Lead Source")) %>%
kable_styling(bootstrap_options = c("striped", "hover", full_width = F))
|
|
EDA Package Review
The following R Packages are introduced:
- DataExplorer
- SmartEDA
- funModeling
- rpivotTable
- XDA
- skmir
- PerformanceAnalytics
- descriptr
- inspectdf
Also included:
- Miscellaneous EDR R code snippets
- synthpop - generate anonymized data from existing data while maintaining data characteristics
Data Explorer
Package intro found here It is the only package (that I have found) that can create a complete EDA report with one function! create_report!
plot_str creates an interactive plot displaying the structure of the data. It does not work interactively in rMarkdown so an image of the output is provided below.
## # A tibble: 1 x 9
## rows columns discrete_columns continuous_colu~ all_missing_col~
## <int> <int> <int> <int> <int>
## 1 45562 11 7 4 0
## # ... with 4 more variables: total_missing_values <int>,
## # complete_rows <int>, total_observations <int>, memory_usage <dbl>
It is easy to customize the plots using ggthemes. There are many to choose from:
## [1] "theme_base" "theme_calc"
## [3] "theme_clean" "theme_economist"
## [5] "theme_economist_white" "theme_excel"
## [7] "theme_excel_new" "theme_few"
## [9] "theme_fivethirtyeight" "theme_foundation"
## [11] "theme_gdocs" "theme_hc"
## [13] "theme_igray" "theme_map"
## [15] "theme_pander" "theme_par"
## [17] "theme_solarized" "theme_solarized_2"
## [19] "theme_solid" "theme_stata"
## [21] "theme_tufte" "theme_wsj"
You can find examples here.
Here is the default plot:
And here is a customized plot:
## 3 columns ignored with more than 10 categories.
## lead_created_date: 41813 categories
## property_state: 52 categories
## lead_source_code: 33 categories
DataExplorer can create an EDA Report automatically - and it is pretty good! It is super good if you need to do something fast.
I encourage you to run the code below (it creates many messages when building - messages I cannot silence so the code chunk is not evaluated when knitr is run to produce HTML output from this document.)
Overall impression - good package, competitive with other EDA package options.
SmartEDA
## Descriptions Obs
## 1 Sample size (Nrow) 45562
## 2 No. of Variables (Ncol) 11
## 3 No. of Numeric Variables 4
## 4 No. of Factor Variables 0
## 5 No. of Text Variables 6
## 6 No. of Logical Variables 0
## 7 No. of Date Variables 2
## 8 No. of Zero variance Variables (Uniform) 0
## 9 %. of Variables having complete cases 100% (11)
## 10 %. of Variables having <50% missing cases 0% (0)
## 11 %. of Variables having >50% missing cases 0% (0)
## 12 %. of Variables having >90% missing cases 0% (0)
## S.no Variable Name Variable Type % of Missing
## 1 1 lead_created_date** POSIXct:POSIXt 0
## 2 2 loan_purpose* character 0
## 3 3 self_reported_fico numeric 0
## 4 4 loan_to_value numeric 0
## 5 5 loan_amount numeric 0
## 6 6 property_state* character 0
## 7 7 existing_customer* character 0
## 8 8 time_to_purchase* character 0
## 9 9 eventually_fund* character 0
## 10 10 lead_source_code* character 0
## 11 11 time_lead_to_call numeric 0
## No. of Unique values
## 1 41813
## 2 3
## 3 13
## 4 7392
## 5 4386
## 6 52
## 7 3
## 8 5
## 9 2
## 10 33
## 11 18078
ExpNumStat provides data overviews by changing parameters, different view are possible.
## Vname Group TN nNeg nZero nPos NegInf PosInf NA_Value
## 3 loan_amount All 45562 0 8 45531 0 0 23
## 2 loan_to_value All 45562 0 17 45527 0 0 18
## 1 self_reported_fico All 45562 0 0 45541 0 0 21
## 4 time_lead_to_call All 45562 0 0 45562 0 0 0
## Per_of_Missing sum min max mean median SD
## 3 0.050 13910384132 0.0 9900000 305461 210499.0 382271.3
## 2 0.040 13163643 0.0 8000000 289 86.1 37733.1
## 1 0.046 32377621 579.0 761 711 720.0 45.4
## 4 0.000 123178882 3.3 350416 2704 582.4 12476.5
## CV IQR Skweness Kurtosis
## 3 1.251 230473.5 9.202 143.652
## 2 130.550 17.2 209.536 44364.599
## 1 0.064 70.0 -0.744 -0.786
## 4 4.615 1133.1 13.462 232.315
## Vname Group TN nNeg nZero nPos NegInf
## 3 loan_amount eventually_fund:No 44887 0 8 44856 0
## 7 loan_amount eventually_fund:Yes 675 0 0 675 0
## 2 loan_to_value eventually_fund:No 44887 0 17 44852 0
## 6 loan_to_value eventually_fund:Yes 675 0 0 675 0
## 1 self_reported_fico eventually_fund:No 44887 0 0 44866 0
## 5 self_reported_fico eventually_fund:Yes 675 0 0 675 0
## 4 time_lead_to_call eventually_fund:No 44887 0 0 44887 0
## 8 time_lead_to_call eventually_fund:Yes 675 0 0 675 0
## PosInf NA_Value Per_of_Missing sum min max mean
## 3 0 23 0.051 13666715361 0.00 9900000 304625.4
## 7 0 0 0.000 243668771 1.00 2748000 360990.8
## 2 0 18 0.040 13112552 0.00 8000000 292.2
## 6 0 0 0.000 51091 9.85 100 75.7
## 1 0 21 0.047 31881212 579.00 761 710.6
## 5 0 0 0.000 496409 620.00 761 735.4
## 4 0 0 0.000 121774657 3.30 350416 2712.9
## 8 0 0 0.000 1404225 5.00 132230 2080.3
## median SD CV IQR Skweness Kurtosis
## 3 210000.0 383765.9 1.260 232000.0 9.22 143.58
## 7 290000.0 258848.2 0.717 324375.0 2.19 11.49
## 2 86.7 38015.8 130.084 17.1 207.98 43707.01
## 6 80.0 16.0 0.212 16.9 -1.08 1.16
## 1 720.0 45.5 0.064 70.0 -0.73 -0.81
## 5 750.0 28.4 0.039 30.0 -1.91 3.64
## 4 590.4 12533.6 4.620 1136.2 13.44 231.16
## 8 385.0 7770.6 3.735 929.6 10.90 147.67
## Vname Group TN nNeg nZero nPos NegInf
## 3 loan_amount eventually_fund:All 45562 0 8 45531 0
## 7 loan_amount eventually_fund:No 44887 0 8 44856 0
## 11 loan_amount eventually_fund:Yes 675 0 0 675 0
## 2 loan_to_value eventually_fund:All 45562 0 17 45527 0
## 6 loan_to_value eventually_fund:No 44887 0 17 44852 0
## 10 loan_to_value eventually_fund:Yes 675 0 0 675 0
## 1 self_reported_fico eventually_fund:All 45562 0 0 45541 0
## 5 self_reported_fico eventually_fund:No 44887 0 0 44866 0
## 9 self_reported_fico eventually_fund:Yes 675 0 0 675 0
## 4 time_lead_to_call eventually_fund:All 45562 0 0 45562 0
## 8 time_lead_to_call eventually_fund:No 44887 0 0 44887 0
## 12 time_lead_to_call eventually_fund:Yes 675 0 0 675 0
## PosInf NA_Value Per_of_Missing sum min max mean
## 3 0 23 0.050 13910384132 0.00 9900000 305460.9
## 7 0 23 0.051 13666715361 0.00 9900000 304625.4
## 11 0 0 0.000 243668771 1.00 2748000 360990.8
## 2 0 18 0.040 13163643 0.00 8000000 289.0
## 6 0 18 0.040 13112552 0.00 8000000 292.2
## 10 0 0 0.000 51091 9.85 100 75.7
## 1 0 21 0.046 32377621 579.00 761 711.0
## 5 0 21 0.047 31881212 579.00 761 710.6
## 9 0 0 0.000 496409 620.00 761 735.4
## 4 0 0 0.000 123178882 3.30 350416 2703.5
## 8 0 0 0.000 121774657 3.30 350416 2712.9
## 12 0 0 0.000 1404225 5.00 132230 2080.3
## median SD CV IQR Skweness Kurtosis
## 3 210499.0 382271.3 1.251 230473.5 9.202 143.652
## 7 210000.0 383765.9 1.260 232000.0 9.225 143.581
## 11 290000.0 258848.2 0.717 324375.0 2.192 11.489
## 2 86.1 37733.1 130.550 17.2 209.536 44364.599
## 6 86.7 38015.8 130.084 17.1 207.978 43707.006
## 10 80.0 16.0 0.212 16.9 -1.084 1.159
## 1 720.0 45.4 0.064 70.0 -0.744 -0.786
## 5 720.0 45.5 0.064 70.0 -0.730 -0.810
## 9 750.0 28.4 0.039 30.0 -1.908 3.637
## 4 582.4 12476.5 4.615 1133.1 13.462 232.315
## 8 590.4 12533.6 4.620 1136.2 13.439 231.164
## 12 385.0 7770.6 3.735 929.6 10.904 147.666
Graphical representation of all numeric features - Density plot (Univariate)
frequency for all categorical independent variables
## Variable Valid Frequency Percent CumPercent
## 1 loan_purpose Purchase 35680 78.31 78.3
## 2 loan_purpose REFI 9862 21.65 100.0
## 3 loan_purpose Unknown 20 0.04 100.0
## 4 loan_purpose TOTAL 45562 NA NA
## 5 existing_customer No 8335 18.29 18.3
## 6 existing_customer Unknown 13011 28.56 46.8
## 7 existing_customer Yes 24216 53.15 100.0
## 8 existing_customer TOTAL 45562 NA NA
## 9 time_to_purchase Immediately 6541 14.36 14.4
## 10 time_to_purchase Later 3353 7.36 21.7
## 11 time_to_purchase N/A 9862 21.65 43.4
## 12 time_to_purchase Soon 5567 12.22 55.6
## 13 time_to_purchase Unknown 20239 44.42 100.0
## 14 time_to_purchase TOTAL 45562 NA NA
## 15 eventually_fund No 44887 98.52 98.5
## 16 eventually_fund Yes 675 1.48 100.0
## 17 eventually_fund TOTAL 45562 NA NA
Bar plots for all categorical variables - simple but well-formatted plots.
When the target is a continuous variable, more plots are available. Better plots would be displayed in the data were different.)
And there is more to the package that I have skipped.
But before we leave this one:
Parallel plot is the equivalent of a spider chart, but with Cartesian coordinates. Thus, it is often preferred.
To understand parallel coordinate plot, look here.
Overall impression - Useful
funModeling
This package provides more functionality than illustrated below. An example of this - gain and lift charts! For these reasons, I suggest funModleing be part of your daily toolset.
Note; Below I occasionally used a different data set to illustrate all the features more effectively.
## variable q_zeros p_zeros q_na p_na q_inf p_inf type
## 1 lead_created_date 0 0.00 0 0.00 0 0 POSIXct/POSIXt
## 2 loan_purpose 0 0.00 0 0.00 0 0 character
## 3 self_reported_fico 0 0.00 21 0.05 0 0 numeric
## 4 loan_to_value 17 0.04 18 0.04 0 0 numeric
## 5 loan_amount 8 0.02 23 0.05 0 0 numeric
## 6 property_state 0 0.00 40 0.09 0 0 character
## 7 existing_customer 0 0.00 0 0.00 0 0 character
## 8 time_to_purchase 0 0.00 0 0.00 0 0 character
## 9 eventually_fund 0 0.00 0 0.00 0 0 character
## 10 lead_source_code 0 0.00 0 0.00 0 0 character
## 11 time_lead_to_call 0 0.00 0 0.00 0 0 numeric
## unique
## 1 41813
## 2 3
## 3 12
## 4 7391
## 5 4385
## 6 51
## 7 3
## 8 5
## 9 2
## 10 33
## 11 18078
## loan_purpose frequency percentage cumulative_perc
## 1 Purchase 35680 78.31 78.3
## 2 REFI 9862 21.65 100.0
## 3 Unknown 20 0.04 100.0
## property_state frequency percentage cumulative_perc
## 1 CA 4860 10.68 10.7
## 2 NY 3971 8.72 19.4
## 3 TX 3820 8.39 27.8
## 4 FL 3741 8.22 36.0
## 5 NJ 1936 4.25 40.3
## 6 IL 1897 4.17 44.4
## 7 PA 1861 4.09 48.5
## 8 GA 1684 3.70 52.2
## 9 NC 1571 3.45 55.7
## 10 OH 1478 3.25 58.9
## 11 MI 1276 2.80 61.7
## 12 MA 1164 2.56 64.3
## 13 VA 1158 2.54 66.8
## 14 MD 991 2.18 69.0
## 15 MO 917 2.01 71.0
## 16 WA 883 1.94 73.0
## 17 TN 876 1.92 74.9
## 18 AZ 872 1.92 76.8
## 19 CO 869 1.91 78.7
## 20 IN 769 1.69 80.4
## 21 SC 724 1.59 82.0
## 22 MN 693 1.52 83.5
## 23 WI 572 1.26 84.8
## 24 LA 537 1.18 85.9
## 25 NV 492 1.08 87.0
## 26 OR 490 1.08 88.1
## 27 CT 474 1.04 89.1
## 28 AL 462 1.01 90.2
## 29 AR 457 1.00 91.2
## 30 OK 375 0.82 92.0
## 31 KS 330 0.72 92.7
## 32 KY 324 0.71 93.4
## 33 UT 296 0.65 94.0
## 34 MS 273 0.60 94.7
## 35 IA 256 0.56 95.2
## 36 NE 246 0.54 95.8
## 37 WV 241 0.53 96.3
## 38 NH 232 0.51 96.8
## 39 NM 192 0.42 97.2
## 40 ME 177 0.39 97.6
## 41 ID 169 0.37 98.0
## 42 DC 163 0.36 98.3
## 43 DE 162 0.36 98.7
## 44 MT 130 0.29 99.0
## 45 RI 111 0.24 99.2
## 46 AK 76 0.17 99.4
## 47 ND 74 0.16 99.5
## 48 SD 73 0.16 99.7
## 49 VT 71 0.16 99.9
## 50 WY 53 0.12 100.0
## 51 HI 3 0.01 100.0
## existing_customer frequency percentage cumulative_perc
## 1 Yes 24216 53.1 53.1
## 2 Unknown 13011 28.6 81.7
## 3 No 8335 18.3 100.0
## time_to_purchase frequency percentage cumulative_perc
## 1 Unknown 20239 44.42 44.4
## 2 N/A 9862 21.65 66.1
## 3 Immediately 6541 14.36 80.4
## 4 Soon 5567 12.22 92.7
## 5 Later 3353 7.36 100.0
## eventually_fund frequency percentage cumulative_perc
## 1 No 44887 98.52 98.5
## 2 Yes 675 1.48 100.0
## lead_source_code frequency percentage cumulative_perc
## 1 Web - Contact Us form 15753 34.57 34.6
## 2 Creditqual-Email-Web 8850 19.42 54.0
## 3 Bankrate 6585 14.45 68.4
## 4 Paid-Search-Web 5074 11.14 79.6
## 5 Marketing-Email-Web 3663 8.04 87.6
## 6 Zillow 2225 4.88 92.5
## 7 Web - Rate Table 738 1.62 94.1
## 8 Third-Party-Web 539 1.18 95.3
## 9 Web - Payment calculator 428 0.94 96.2
## 10 Informa-Web 403 0.88 97.1
## 11 Ally Website-Phone 289 0.63 97.8
## 12 Marketing Lead 280 0.61 98.4
## 13 Web - Affordability calculator 221 0.49 98.8
## 14 Web - Refi calculator 165 0.36 99.2
## 15 Quinstreet-Web 124 0.27 99.5
## 16 LendingTree-WarmTransfer-Phone 79 0.17 99.7
## 17 Mobile - Contact Us form 41 0.09 99.7
## 18 MLS-Listing-Web 31 0.07 99.8
## 19 LendingTree-RateTable 12 0.03 99.8
## 20 After-Hours-Call-Phone 9 0.02 99.9
## 21 Creditqual-Email-Phone 9 0.02 99.9
## 22 Affinity-Dealer-Phone 8 0.02 99.9
## 23 Third-Party-Phone 8 0.02 99.9
## 24 Affinity-Ally-Employee-Phone 6 0.01 99.9
## 25 Paid-Search-Phone 5 0.01 99.9
## 26 Informa-Phone 4 0.01 100.0
## 27 Mobile - Rate Table 4 0.01 100.0
## 28 Affinity-Ally-Employee-Web 2 0.00 100.0
## 29 Dealer-Referral 2 0.00 100.0
## 30 LendingTree-RateTable-Phone 2 0.00 100.0
## 31 Affinity-Vendor-Phone 1 0.00 100.0
## 32 MLS-Listing-Phone 1 0.00 100.0
## 33 Quinstreet 1 0.00 100.0
## [1] "Variables processed: loan_purpose, property_state, existing_customer, time_to_purchase, eventually_fund, lead_source_code"
## property_state frequency percentage cumulative_perc
## 1 CA 4860 10.67 10.7
## 2 NY 3971 8.72 19.4
## 3 TX 3820 8.38 27.8
## 4 FL 3741 8.21 36.0
## 5 NJ 1936 4.25 40.2
## 6 IL 1897 4.16 44.4
## 7 PA 1861 4.08 48.5
## 8 GA 1684 3.70 52.2
## 9 NC 1571 3.45 55.6
## 10 OH 1478 3.24 58.9
myData$property_state2 =ifelse(myData$property_state %in% property_freq[1:25,'property_state'],
myData$property_state, 'other')
freq(myData, 'property_state2')## property_state2 frequency percentage cumulative_perc
## 1 other 5950 13.06 13.1
## 2 CA 4860 10.67 23.7
## 3 NY 3971 8.72 32.5
## 4 TX 3820 8.38 40.8
## 5 FL 3741 8.21 49.0
## 6 NJ 1936 4.25 53.3
## 7 IL 1897 4.16 57.5
## 8 PA 1861 4.08 61.5
## 9 GA 1684 3.70 65.2
## 10 NC 1571 3.45 68.7
## 11 OH 1478 3.24 71.9
## 12 MI 1276 2.80 74.7
## 13 MA 1164 2.55 77.3
## 14 VA 1158 2.54 79.8
## 15 MD 991 2.18 82.0
## 16 MO 917 2.01 84.0
## 17 WA 883 1.94 85.9
## 18 TN 876 1.92 87.9
## 19 AZ 872 1.91 89.8
## 20 CO 869 1.91 91.7
## 21 IN 769 1.69 93.4
## 22 SC 724 1.59 95.0
## 23 MN 693 1.52 96.5
## 24 WI 572 1.26 97.7
## 25 LA 537 1.18 98.9
## 26 NV 492 1.08 100.0
## var en mi ig gr
## 1 heart_disease_severity 1.85 0.995 0.995084 0.539066
## 2 thal 2.03 0.209 0.209455 0.168046
## 3 exer_angina 1.77 0.139 0.139139 0.152639
## 4 exter_angina 1.77 0.139 0.139139 0.152639
## 5 chest_pain 2.53 0.205 0.205019 0.118029
## 6 num_vessels_flour 2.38 0.182 0.181522 0.115774
## 7 slope 2.18 0.112 0.112422 0.086880
## 8 serum_cholestoral 7.48 0.561 0.560556 0.079556
## 9 gender 1.84 0.057 0.057254 0.063297
## 10 oldpeak 4.87 0.249 0.249167 0.060358
## 11 max_heart_rate 6.83 0.334 0.333617 0.054070
## 12 resting_blood_pressure 5.57 0.143 0.142555 0.030239
## 13 age 5.93 0.137 0.137175 0.027055
## 14 resting_electro 2.06 0.024 0.024148 0.022194
## 15 fasting_blood_sugar 1.60 0.000 0.000459 0.000758
ggplot(variable_importance, aes(x = reorder(var, gr), y = gr, fill = var)) +
geom_bar(stat = "identity") + coord_flip() + theme_bw() +
xlab("") + ylab("Variable Importance (based on Information Gain)") + guides(fill = FALSE)#Is heart_disease_severity the feature that explains the target the most?
#No, this variable was used to generate the target, thus we must exclude it.
heart_disease$oldpeak_2 = equal_freq(var=heart_disease$oldpeak, n_bins = 3)
summary(heart_disease$oldpeak_2)## [0.0,0.2) [0.2,1.5) [1.5,6.2]
## 106 107 90
vars_to_analyze=c("age", "oldpeak", "max_heart_rate")
cross_plot(data=heart_disease, target="has_heart_disease", input=vars_to_analyze)Overall impression - best overall EDA package with functionality beyond EDA.
rpivotTable
Pretty cool - an interactive tool for precision/recall!
library(rpivotTable)
## reading the data
data=read.delim(file="https://goo.gl/ac5AkG", sep="\t", header = T, stringsAsFactors=F)
data$predicted_target=ifelse(data$score>=0.5, "yes", "no")
rpivotTable(data = data, rows = "predicted_target", cols="target", aggregatorName = "Count",
rendererName = "Table", width="100%", height="400px")Overall impression - useful in limited cases (even thought it is pretty cool.) Might be useful to include in a shiny app.
XDA
#devtools::install_local("C:\\Users\\czbs7d\\Documents\\R\\packagesGithub\\xda-master.zip")
library(xda)
numSummary(myData)## n mean sd max min range nunique
## self_reported_fico 45541 711 45.4 761 579.0 182 13
## loan_to_value 45544 289 37733.1 8000000 0.0 8000000 7392
## loan_amount 45539 305461 382271.3 9900000 0.0 9900000 4386
## time_lead_to_call 45562 2704 12476.5 350416 3.3 350413 18078
## nzeros iqr lowerbound upperbound noutlier
## self_reported_fico 0 70.0 575.0 855 0
## loan_to_value 17 17.2 51.9 121 3688
## loan_amount 8 230497.0 -216219.0 705746 2969
## time_lead_to_call 0 1133.1 -1411.4 3121 5797
## kurtosis skewness mode miss miss% 1% 5%
## self_reported_fico -0.786 -0.743 750 21 0.0461 620.0 620.0
## loan_to_value 44362.651 209.529 80 18 0.0395 17.8 43.6
## loan_amount 143.645 9.201 60000 23 0.0505 60000.0 64000.0
## time_lead_to_call 232.304 13.462 246 0 0.0000 241.3 245.1
## 25% 50% 75% 95% 99%
## self_reported_fico 680.0 720.0 750 761.0 761
## loan_to_value 77.8 86.1 95 99.4 100
## loan_amount 129526.5 210499.0 360000 792000.0 1500000
## time_lead_to_call 288.3 582.4 1421 7359.5 42506
## n miss miss% unique
## loan_purpose 45562 0 0.0000 3
## property_state 45522 40 0.0878 52
## existing_customer 45562 0 0.0000 3
## time_to_purchase 45562 0 0.0000 5
## eventually_fund 45562 0 0.0000 2
## lead_source_code 45562 0 0.0000 33
## property_state2 45562 0 0.0000 26
## top5levels:count
## loan_purpose Purchase:35680, REFI:9862, Unknown:20
## property_state CA:4860, NY:3971, TX:3820, FL:3741, NJ:1936
## existing_customer Yes:24216, Unknown:13011, No:8335
## time_to_purchase Unknown:20239, N/A:9862, Immediately:6541, Soon:5567, Later:3353
## eventually_fund No:44887, Yes:675
## lead_source_code Web - Contact Us form:15753, Creditqual-Email-Web:8850, Bankrate:6585, Paid-Search-Web:5074, Marketing-Email-Web:3663
## property_state2 other:5950, CA:4860, NY:3971, TX:3820, FL:3741
Overall impression - does not do anything better than the other packages reviewed.
skimr
## Skim summary statistics
## n obs: 45562
## n variables: 12
##
## -- Variable type:character --------------------------------------------------------------------------------------------------------------------------------------------------
## variable missing complete n min max empty n_unique
## eventually_fund 0 45562 45562 2 3 0 2
## existing_customer 0 45562 45562 2 7 0 3
## lead_source_code 0 45562 45562 6 30 0 33
## loan_purpose 0 45562 45562 4 8 0 3
## property_state 40 45522 45562 2 2 0 51
## property_state2 0 45562 45562 2 5 0 26
## time_to_purchase 0 45562 45562 3 11 0 5
##
## -- Variable type:numeric ----------------------------------------------------------------------------------------------------------------------------------------------------
## variable missing complete n mean sd p0
## loan_amount 23 45539 45562 305460.9 382271.29 0
## loan_to_value 18 45544 45562 289.03 37733.07 0
## self_reported_fico 21 45541 45562 710.96 45.38 579
## time_lead_to_call 0 45562 45562 2703.54 12476.47 3.3
## p25 p50 p75 p100 hist
## 129526.5 210499 360000 9900000 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 77.78 86.06 95 8000000 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 680 720 750 761 <U+2581><U+2581><U+2582><U+2582><U+2581><U+2582><U+2582><U+2587>
## 288.3 582.4 1421.4 350416 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##
## -- Variable type:POSIXct ----------------------------------------------------------------------------------------------------------------------------------------------------
## variable missing complete n min max median
## lead_created_date 0 45562 45562 2017-12-31 2018-10-22 2018-05-12
## n_unique
## 41813
Skim summary statistics
n obs: 45562
n variables: 12
| variable | missing | complete | n | min | max | empty | n_unique |
|---|---|---|---|---|---|---|---|
| eventually_fund | 0 | 45562 | 45562 | 2 | 3 | 0 | 2 |
| existing_customer | 0 | 45562 | 45562 | 2 | 7 | 0 | 3 |
| lead_source_code | 0 | 45562 | 45562 | 6 | 30 | 0 | 33 |
| loan_purpose | 0 | 45562 | 45562 | 4 | 8 | 0 | 3 |
| property_state | 40 | 45522 | 45562 | 2 | 2 | 0 | 51 |
| property_state2 | 0 | 45562 | 45562 | 2 | 5 | 0 | 26 |
| time_to_purchase | 0 | 45562 | 45562 | 3 | 11 | 0 | 5 |
| variable | missing | complete | n | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| loan_amount | 23 | 45539 | 45562 | 305460.9 | 382271.29 | 0 | 129526.5 | 210499 | 360000 | 9900000 | <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> |
| loan_to_value | 18 | 45544 | 45562 | 289.03 | 37733.07 | 0 | 77.78 | 86.06 | 95 | 8000000 | <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> |
| self_reported_fico | 21 | 45541 | 45562 | 710.96 | 45.38 | 579 | 680 | 720 | 750 | 761 | <U+2581><U+2581><U+2582><U+2582><U+2581><U+2582><U+2582><U+2587> |
| time_lead_to_call | 0 | 45562 | 45562 | 2703.54 | 12476.47 | 3.3 | 288.3 | 582.4 | 1421.4 | 350416 | <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> |
| variable | missing | complete | n | min | max | median | n_unique |
|---|---|---|---|---|---|---|---|
| lead_created_date | 0 | 45562 | 45562 | 2017-12-31 | 2018-10-22 | 2018-05-12 | 41813 |
Overall impression - good when you need a compact view or summary statistics.
PerformanceAnalytics
PerformanceAnalytics is a library of functions designed for evaluating the performance and risk characteristics of financial assets or funds. It does to apply to more general use.
descriptr
This is one of the newest EDA package. I was made aware of it 3/5/19.
https://descriptr.rsquaredacademy.com/
- Summary statistics
- Two way tables
- One way table
- Group wise summary
- Multiple variable statistics
- Multiple one way tables
- Multiple two way tables
I will concentrate on the visuralizations suppported in the package.
THere is a Shivy version too!
ds_launch_shiny_app()
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
## $ am : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
## $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
## $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...
The ds_screener() function will screen data frames and return details such as variable names, class, levels and missing values. Theplot.screener() creates bar plots to visualize % of missing observations for each variable in a data frame.
## -----------------------------------------------------------------------
## | Column Name | Data Type | Levels | Missing | Missing (%) |
## -----------------------------------------------------------------------
## | mpg | numeric | NA | 0 | 0 |
## | cyl | factor | 4 6 8 | 0 | 0 |
## | disp | numeric | NA | 0 | 0 |
## | hp | numeric | NA | 0 | 0 |
## | drat | numeric | NA | 0 | 0 |
## | wt | numeric | NA | 0 | 0 |
## | qsec | numeric | NA | 0 | 0 |
## | vs | factor | 0 1 | 0 | 0 |
## | am | factor | 0 1 | 0 | 0 |
## | gear | factor | 3 4 5 | 0 | 0 |
## | carb | factor |1 2 3 4 6 8| 0 | 0 |
## -----------------------------------------------------------------------
##
## Overall Missing Values 0
## Percentage of Missing Values 0 %
## Rows with Missing Values 0
## Columns With Missing Values 0
Overall Impression:
inspectdf
Data
## Parsed with column specification:
## cols(
## ID = col_double(),
## Gender = col_character(),
## Grade = col_double(),
## Horoscope = col_character(),
## Subject = col_character(),
## IntExt = col_character(),
## OptPest = col_character(),
## ScreenTime = col_double(),
## Sleep = col_double(),
## PhysActive = col_double(),
## HrsHomework = col_double(),
## SpendTime1 = col_character(),
## SpendTime2 = col_character(),
## Self1 = col_character(),
## Self2 = col_character(),
## Career = col_character(),
## Superpower = col_character()
## )
## # A tibble: 10 x 17
## ID Gender Grade Horoscope Subject IntExt OptPest ScreenTime Sleep
## <dbl> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 1 male 4 Scorpio Math Extra~ Optimi~ 1 7
## 2 2 female 4 Capricorn Gym Extra~ Optimi~ 1 8
## 3 3 male 4 Taurus Math Intro~ Optimi~ 4 9
## 4 4 male 4 Aquarius Math Don't~ Don't ~ 3 9
## 5 5 male 4 Scorpio Gym Don't~ Don't ~ 1 9
## 6 6 male 4 Pisces Gym Extra~ Optimi~ 2 9
## 7 7 male 3 Scorpio Art Intro~ Optimi~ 1 11
## 8 8 male 6 Taurus Math Extra~ Optimi~ 4 9
## 9 9 male 6 Aries Gym Intro~ Pessim~ 6 8
## 10 10 male 6 Pisces Math Intro~ Don't ~ 3 9
## # ... with 8 more variables: PhysActive <dbl>, HrsHomework <dbl>,
## # SpendTime1 <chr>, SpendTime2 <chr>, Self1 <chr>, Self2 <chr>,
## # Career <chr>, Superpower <chr>
## [1] 185 17
We need three data frames. We need one data frame with the complete data set. We simply rename df to allGrades. We also need two subsetted data sets to leverage the packages easy data frame comparison features. We create the data frames oldGrades (6-8) and youngGrades (3-5).
allGrades <- df
oldGrades <- allGrades %>% filter(Grade > 5)
youngGrades <- allGrades %>% filter(Grade < 6)
ggplot(oldGrades, aes(x=Grade)) + geom_histogram()inspectdf Functions
Simply pass in a dataframe, or two (for comparisons) and set show_plot = TRUE. The output will include both a tibble with the raw data and a visualization.
inspect_types() function
## The `show_plot = TRUE` is deprecated and will be removed in a future version. The `show_plot()` function should be used instead. For more info, check out the help file ?show_plot()
## # A tibble: 2 x 4
## type cnt pcnt col_name
## <chr> <int> <dbl> <list>
## 1 character 11 64.7 <chr [11]>
## 2 numeric 6 35.3 <chr [6]>
Compare between youngGrades and oldGrades
## The `show_plot = TRUE` is deprecated and will be removed in a future version. The `show_plot()` function should be used instead. For more info, check out the help file ?show_plot()
## # A tibble: 2 x 5
## type cnt_1 pcnt_1 cnt_2 pcnt_2
## <chr> <int> <dbl> <int> <dbl>
## 1 character 11 64.7 11 64.7
## 2 numeric 6 35.3 6 35.3
inspect_mem() function
## The `show_plot = TRUE` is deprecated and will be removed in a future version. The `show_plot()` function should be used instead. For more info, check out the help file ?show_plot()
## # A tibble: 17 x 3
## col_name size pcnt
## <chr> <chr> <dbl>
## 1 Career 8.04 Kb 12.8
## 2 SpendTime2 7.88 Kb 12.5
## 3 Superpower 7.8 Kb 12.4
## 4 Self2 7.27 Kb 11.6
## 5 Self1 7.21 Kb 11.5
## 6 SpendTime1 6.84 Kb 10.9
## 7 Horoscope 2.17 Kb 3.45
## 8 Subject 1.77 Kb 2.80
## 9 IntExt 1.68 Kb 2.67
## 10 OptPest 1.68 Kb 2.67
## 11 Gender 1.66 Kb 2.64
## 12 ID 1.49 Kb 2.37
## 13 Grade 1.49 Kb 2.37
## 14 ScreenTime 1.49 Kb 2.37
## 15 Sleep 1.49 Kb 2.37
## 16 PhysActive 1.49 Kb 2.37
## 17 HrsHomework 1.49 Kb 2.37
Compare between youngGrades and oldGrades
## The `show_plot = TRUE` is deprecated and will be removed in a future version. The `show_plot()` function should be used instead. For more info, check out the help file ?show_plot()
## # A tibble: 17 x 5
## col_name size_1 size_2 pcnt_1 pcnt_2
## <chr> <chr> <chr> <dbl> <dbl>
## 1 SpendTime2 4.09 Kb 4.79 Kb 12.9 12.2
## 2 Career 3.91 Kb 5.33 Kb 12.3 13.6
## 3 Superpower 3.8 Kb 4.95 Kb 12.0 12.6
## 4 Self1 3.65 Kb 4.51 Kb 11.5 11.5
## 5 Self2 3.61 Kb 5.02 Kb 11.4 12.8
## 6 SpendTime1 3.5 Kb 3.87 Kb 11.0 9.85
## 7 Horoscope 1.38 Kb 1.52 Kb 4.34 3.88
## 8 Subject 992 bytes 1.12 Kb 3.06 2.85
## 9 IntExt 904 bytes 1.03 Kb 2.78 2.63
## 10 OptPest 904 bytes 1.03 Kb 2.78 2.63
## 11 Gender 888 bytes 1.02 Kb 2.74 2.59
## 12 ID 712 bytes 864 bytes 2.19 2.15
## 13 Grade 712 bytes 864 bytes 2.19 2.15
## 14 ScreenTime 712 bytes 864 bytes 2.19 2.15
## 15 Sleep 712 bytes 864 bytes 2.19 2.15
## 16 PhysActive 712 bytes 864 bytes 2.19 2.15
## 17 HrsHomework 712 bytes 864 bytes 2.19 2.15
inspect_na() function
## The `show_plot = TRUE` is deprecated and will be removed in a future version. The `show_plot()` function should be used instead. For more info, check out the help file ?show_plot()
## # A tibble: 17 x 3
## col_name cnt pcnt
## <chr> <dbl> <dbl>
## 1 Superpower 8 4.32
## 2 Self1 2 1.08
## 3 Career 2 1.08
## 4 Subject 1 0.541
## 5 PhysActive 1 0.541
## 6 Self2 1 0.541
## 7 ID 0 0
## 8 Gender 0 0
## 9 Grade 0 0
## 10 Horoscope 0 0
## 11 IntExt 0 0
## 12 OptPest 0 0
## 13 ScreenTime 0 0
## 14 Sleep 0 0
## 15 HrsHomework 0 0
## 16 SpendTime1 0 0
## 17 SpendTime2 0 0
Compare between youngGrades and oldGrades
## The `show_plot = TRUE` is deprecated and will be removed in a future version. The `show_plot()` function should be used instead. For more info, check out the help file ?show_plot()
## # A tibble: 17 x 6
## col_name cnt_1 pcnt_1 cnt_2 pcnt_2 p_value
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Superpower 3 3.61 5 4.90 0.948
## 2 Self1 2 2.41 0 0 0.389
## 3 Career 2 2.41 0 0 0.389
## 4 Subject 1 1.20 0 0 0.918
## 5 PhysActive 1 1.20 0 0 0.918
## 6 Self2 1 1.20 0 0 0.918
## 7 ID 0 0 0 0 NA
## 8 Gender 0 0 0 0 NA
## 9 Grade 0 0 0 0 NA
## 10 Horoscope 0 0 0 0 NA
## 11 IntExt 0 0 0 0 NA
## 12 OptPest 0 0 0 0 NA
## 13 ScreenTime 0 0 0 0 NA
## 14 Sleep 0 0 0 0 NA
## 15 HrsHomework 0 0 0 0 NA
## 16 SpendTime1 0 0 0 0 NA
## 17 SpendTime2 0 0 0 0 NA
inspect_num() function
## The `show_plot = TRUE` is deprecated and will be removed in a future version. The `show_plot()` function should be used instead. For more info, check out the help file ?show_plot()
## # A tibble: 6 x 10
## col_name min q1 median mean q3 max sd pcnt_na hist
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list>
## 1 Grade 3 5 6 5.74 7 8 1.39 0 <tibble [2~
## 2 HrsHomewo~ 0 1 3 4.17 6 35 4.79 0 <tibble [2~
## 3 ID 1 47 93 93 139 185 53.5 0 <tibble [2~
## 4 PhysActive 0 6 9 11.5 12 82 11.8 0.541 <tibble [1~
## 5 ScreenTime 0 1 3 3.00 4 18 2.32 0 <tibble [2~
## 6 Sleep 2 8 9 8.64 10 12 1.54 0 <tibble [2~
Compare between youngGrades and oldGrades
## The `show_plot = TRUE` is deprecated and will be removed in a future version. The `show_plot()` function should be used instead. For more info, check out the help file ?show_plot()
## # A tibble: 6 x 5
## col_name hist_1 hist_2 jsd fisher_p
## <chr> <list> <list> <dbl> <dbl>
## 1 Grade <tibble [22 x 2]> <tibble [22 x 2]> 1 0.0001000
## 2 HrsHomework <tibble [23 x 2]> <tibble [23 x 2]> 0.0837 0.203
## 3 ID <tibble [20 x 2]> <tibble [20 x 2]> 0.471 0.0001000
## 4 PhysActive <tibble [19 x 2]> <tibble [19 x 2]> 0.0626 0.189
## 5 ScreenTime <tibble [16 x 2]> <tibble [16 x 2]> 0.122 0.000300
## 6 Sleep <tibble [20 x 2]> <tibble [20 x 2]> 0.132 0.000800
inspect_imb() function to identify factors which might be overly prevalent.
## The `show_plot = TRUE` is deprecated and will be removed in a future version. The `show_plot()` function should be used instead. For more info, check out the help file ?show_plot()
## # A tibble: 11 x 4
## col_name value pcnt cnt
## <chr> <chr> <dbl> <int>
## 1 OptPest Optimist 53.0 98
## 2 IntExt Extravert 51.4 95
## 3 Gender female 49.7 92
## 4 Subject Gym 37.3 69
## 5 Horoscope Aries 11.4 21
## 6 SpendTime1 video games 11.4 21
## 7 Self1 funny 9.73 18
## 8 Career lawyer 7.03 13
## 9 SpendTime2 reading 6.49 12
## 10 Superpower fly 5.95 11
## 11 Self2 active 4.86 9
Compare between youngGrades and oldGrades
## The `show_plot = TRUE` is deprecated and will be removed in a future version. The `show_plot()` function should be used instead. For more info, check out the help file ?show_plot()
## # A tibble: 11 x 7
## col_name value pcnt_1 cnt_1 pcnt_2 cnt_2 p_value
## <chr> <chr> <dbl> <int> <dbl> <int> <dbl>
## 1 OptPest Optimist 54.2 45 52.0 53 0.875
## 2 IntExt Extravert 51.8 43 51.0 52 1.000
## 3 Gender female 49.4 41 50 51 1.000
## 4 Subject Gym 38.6 32 36.3 37 0.868
## 5 Horoscope Scorpio 15.7 13 NA NA NA
## 6 SpendTime1 video games 15.7 13 7.84 8 0.151
## 7 Self1 funny 10.8 9 8.82 9 0.832
## 8 SpendTime2 reading 9.64 8 NA NA NA
## 9 Self2 active 8.43 7 NA NA NA
## 10 Superpower fly 8.43 7 NA NA NA
## 11 Career engineer 7.23 6 NA NA NA
inspect_cat() function
## The `show_plot = TRUE` is deprecated and will be removed in a future version. The `show_plot()` function should be used instead. For more info, check out the help file ?show_plot()
## # A tibble: 11 x 5
## col_name cnt common common_pcnt levels
## <chr> <int> <chr> <dbl> <list>
## 1 Career 107 lawyer 7.03 <tibble [107 x 3]>
## 2 Gender 3 female 49.7 <tibble [3 x 3]>
## 3 Horoscope 12 Aries 11.4 <tibble [12 x 3]>
## 4 IntExt 3 Extravert 51.4 <tibble [3 x 3]>
## 5 OptPest 3 Optimist 53.0 <tibble [3 x 3]>
## 6 Self1 100 funny 9.73 <tibble [100 x 3]>
## 7 Self2 102 active 4.86 <tibble [102 x 3]>
## 8 SpendTime1 90 video games 11.4 <tibble [90 x 3]>
## 9 SpendTime2 107 reading 6.49 <tibble [107 x 3]>
## 10 Subject 6 Gym 37.3 <tibble [6 x 3]>
## 11 Superpower 99 fly 5.95 <tibble [99 x 3]>
Compare between youngGrades and oldGrades
## The `show_plot = TRUE` is deprecated and will be removed in a future version. The `show_plot()` function should be used instead. For more info, check out the help file ?show_plot()
## # A tibble: 11 x 5
## col_name jsd fisher_p lvls_1 lvls_2
## <chr> <dbl> <dbl> <list> <list>
## 1 Career 0.591 0.262 <tibble [53 x 3]> <tibble [74 x 3]>
## 2 Gender 0.000102 1 <tibble [3 x 3]> <tibble [3 x 3]>
## 3 Horoscope 0.0590 0.193 <tibble [12 x 3]> <tibble [12 x 3]>
## 4 IntExt 0.0281 0.0300 <tibble [3 x 3]> <tibble [3 x 3]>
## 5 OptPest 0.000386 0.980 <tibble [3 x 3]> <tibble [3 x 3]>
## 6 Self1 0.602 0.0188 <tibble [52 x 3]> <tibble [64 x 3]>
## 7 Self2 0.582 0.0784 <tibble [52 x 3]> <tibble [73 x 3]>
## 8 SpendTime1 0.620 0.000400 <tibble [47 x 3]> <tibble [51 x 3]>
## 9 SpendTime2 0.656 0.0103 <tibble [57 x 3]> <tibble [66 x 3]>
## 10 Subject 0.00837 0.935 <tibble [6 x 3]> <tibble [5 x 3]>
## 11 Superpower 0.563 0.126 <tibble [50 x 3]> <tibble [65 x 3]>
inspect_cor() function
Why insepct_cor over base corr?
cor()requires numeric inputs only- Correlation matrices are hard to read
cor()doesn’t produce confidence intervalscor()andcor.test()don’t provide visualisation methods out of the box
## The `show_plot = TRUE` is deprecated and will be removed in a future version. The `show_plot()` function should be used instead. For more info, check out the help file ?show_plot()
## # A tibble: 15 x 6
## col_1 col_2 corr p_value lower upper
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Sleep ScreenTime -0.481 4.38e-12 -0.598 -0.343
## 2 ScreenTime Grade 0.413 5.19e- 9 0.266 0.541
## 3 Sleep Grade -0.331 4.08e- 6 -0.470 -0.176
## 4 HrsHomework ScreenTime 0.219 2.70e- 3 0.0568 0.371
## 5 PhysActive Grade -0.192 8.97e- 3 -0.346 -0.0280
## 6 PhysActive ScreenTime -0.176 1.71e- 2 -0.331 -0.0109
## 7 HrsHomework ID -0.145 4.89e- 2 -0.302 0.0201
## 8 HrsHomework Grade 0.122 9.87e- 2 -0.0437 0.281
## 9 HrsHomework PhysActive 0.115 1.21e- 1 -0.0513 0.275
## 10 Sleep ID -0.0629 3.95e- 1 -0.225 0.103
## 11 HrsHomework Sleep -0.0586 4.28e- 1 -0.221 0.107
## 12 ScreenTime ID 0.0445 5.47e- 1 -0.121 0.208
## 13 PhysActive ID 0.0284 7.02e- 1 -0.137 0.193
## 14 Grade ID 0.0217 7.70e- 1 -0.143 0.186
## 15 PhysActive Sleep 0.0148 8.42e- 1 -0.151 0.179
Compare between youngGrades and oldGrades
## The `show_plot = TRUE` is deprecated and will be removed in a future version. The `show_plot()` function should be used instead. For more info, check out the help file ?show_plot()
## # A tibble: 15 x 5
## col_1 col_2 corr_1 corr_2 p_value
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 ScreenTime Grade 0.500 0.255 0.0550
## 2 Sleep Grade -0.322 -0.142 0.204
## 3 Sleep ScreenTime -0.291 -0.506 0.0869
## 4 HrsHomework PhysActive 0.268 -0.0336 0.0403
## 5 PhysActive Grade -0.224 -0.0621 0.272
## 6 ScreenTime ID 0.174 -0.0300 0.171
## 7 PhysActive ScreenTime -0.169 -0.185 0.911
## 8 Grade ID 0.163 -0.109 0.0685
## 9 HrsHomework ScreenTime 0.121 0.236 0.427
## 10 HrsHomework Grade 0.0963 0.0887 0.959
## 11 HrsHomework ID -0.0569 -0.209 0.302
## 12 PhysActive Sleep -0.0343 -0.0248 0.949
## 13 HrsHomework Sleep -0.0223 -0.0443 0.883
## 14 Sleep ID 0.00728 -0.0993 0.477
## 15 PhysActive ID 0.0000702 0.119 0.425
Overall Impression
I like this package because it’s got a lot of functionality and it’s incredibly straightforward. In short it allows you to understand and visualize column types, sizes, values, value imbalance & distributions as well as correlations. Further to this, it allows you to very easily perform any of the above features for an individual data frame, or to compare the differences between two data frames.
Miscellaneous Code
No Variance
Collect Variables by Type
Potential ID Columns
## Parse column name to look for ID to suggest ID columns
myIDs <- filter(df_status(dataset), grepl("id", variable, ignore.case = TRUE)) %>% select(variable, type, unique)
myIDs[,1]
# ID variables rarely useful in algorithm development. Consider removing from dataset.
if(nrow(myIDs) > 0){dataset <- dataset %>% select(-contains("id"))}Columns to Factors
# Adjust number in filter to change unique count cutoff limit
## Identify columns to be converted to factors
temp <- dataset[colnames(dataset[,sapply(dataset,is.character)])] %>% summarise_all(funs(n_distinct(.)))
factor_columns <- melt(as.data.frame(temp)) %>% filter(value <= 15) %>% select(variable) %>% .[["variable"]] %>% as.character()
## Convert columns to factors if less than 15 unique values
dataset[factor_columns] <- lapply(dataset[factor_columns], factor)
rm(temp)Missing Data Patterns
mice_plot <- aggr(dataset, col=c('navyblue','yellow'),
numbers=TRUE, sortVars=TRUE,
labels=names(dataset), cex.axis=.7,
gap=3, ylab=c("Missing data","Pattern"))
# Percentage of Missing Values Per Feature
plot_missing(dataset)
# Drop specific columns with missing data
dataset <- dataset[ , !(names(dataset) %in% c("Column1", "Column2"))]Imputed Data
MICE
MICE assumes that the missing data are Missing at Random (MAR), which means that the probability that a value is missing depends only on observed value and can be predicted using them. It imputes data on a variable by variable basis by specifying an imputation model per variable.
For example: Suppose we have X1, X2..Xk variables. If X1 has missing values, then it will be regressed on other variables X2 to Xk. The missing values in X1 will be then replaced by predictive values obtained. Similarly, if X2 has missing values, then X1, X3 to Xk variables will be used in prediction model as independent variables. Later, missing values will be replaced with predicted values.
By default, linear regression is used to predict continuous missing values. Logistic regression is used for categorical missing values. Once this cycle is complete, multiple data sets are generated. These data sets differ only in imputed missing values. Generally, it’s considered to be a good practice to build models on these data sets separately and combining their results.
# Inpute missing data
imputed_Data <- mice(dataset, m=5, maxit = 50, method = 'pmm', seed = 500)
summary(imputed_Data)
# Create density plot of inputed variables. You are looking to be sure the distributions are about equal. Blue represents actual observations.
densityplot(imputed_Data)
# Add the inputed data back into the dataset to complete it
dataset_mice <- complete(imputed_Data,1)Impute With missForest
missForest is an implementation of random forest algorithm. It’s a non parametric imputation method applicable to various variable types.It builds a random forest model for each variable. Then it uses the model to predict missing values in the variable with the help of observed values.
Since bagging works well on categorical variable too, we don’t need to remove them here.
It appears
charactervariables cause missForest to error
missForest not a good solution when few distinct values in a column
## Interesting, in the package the function prodNA can be used to randomly introcude NAs
dataset_mis <- prodNA(dataset, noNA = 0.1)
dataset_mis <- dataset_mis %>% mutate_if(is.character, is.factor)
sum(is.na(dataset_mis))
imputed_Data_missF <- missForest(as.data.frame(dataset_mis))
sum(is.na(imputed_Data_missF))
rm(dataset_mis)Hmisc
aregImpute() allows mean imputation using additive regression, bootstrapping, and predictive mean matching.
In bootstrapping, different bootstrap resamples are used for each of multiple imputations. Then, a flexible additive model (non parametric regression method) is fitted on samples taken with replacements from original data and missing values (acts as dependent variable) are predicted using non-missing values (independent variable).
Then, it uses predictive mean matching (default) to impute missing values. Predictive mean matching works well for continuous and categorical (binary & multi-level)
Factor Relevel
Examine and re-order factors in the dataset
# Show all levels of a factor
levels(dataset$Embarked)
# Manually order factors
dataset$Embarked <- factor(dataset$Embarked, levels = c("S", "Q", "C"))
# Make a specific factor first
dataset$Embarked <- relevel(dataset$Embarked, "Q")
# Reverse the order of the factor
dataset$Embarked <- factor(dataset$Embarked, levels=rev(levels(dataset$Embarked)))Outlier Management
Change the declared y-value to the column of interest
for (i in 1:length(numeric_columns)){
print(numeric_columns[i])
tempplot <- ggplot(dataset, aes(x = "", y = numeric_columns[i])) + geom_boxplot(outlier.color="red", outlier.shape=8, outlier.size=4) + scale_y_continuous()}Below, change the variable cutOff to meet you needs. cutOff determines the number of unique values in each numeric column needed to produce a boxplot.
cutOff = 10 #Number of unique values required to output a boxplot
tmpDF <- dataset[, sapply(dataset, is.numeric)] #get only numeric columns
tmpDF <- na.omit(tmpDF) #just a check to avoid ggplot messages
#keep cols that meet cutOff requirement
tmpDF <- tmpDF[, sapply(tmpDF, function(col) length(unique(col))) >= cutOff]
tmpDF <- data.frame(tmpDF)
#https://stackoverflow.com/questions/15027659/how-do-you-draw-a-boxplot-without-specifying-x-axis
plot_data = function (data, column)
ggplot(data = data, aes_string(x = factor(0), y = column)) +
geom_boxplot(outlier.color="red", outlier.shape=8, outlier.size=2) + xlab(column) + ylab("")
myplots <- lapply(colnames(tmpDF), plot_data, data = tmpDF)
myplots
rm(cutOff, tmpDF, myplots)Custom Outlier Function
outlierKD <- function(dt, var) {
var_name <- eval(substitute(var),eval(dt))
na1 <- sum(is.na(var_name))
m1 <- mean(var_name, na.rm = T)
par(mfrow=c(2, 2), oma=c(0,0,3,0))
boxplot(var_name, main="With outliers")
hist(var_name, main="With outliers", xlab=NA, ylab=NA)
outlier <- boxplot.stats(var_name)$out
mo <- mean(outlier)
var_name <- ifelse(var_name %in% outlier, NA, var_name)
boxplot(var_name, main="Without outliers")
hist(var_name, main="Without outliers", xlab=NA, ylab=NA)
title("Outlier Check", outer=TRUE)
na2 <- sum(is.na(var_name))
cat("Outliers identified:", na2 - na1, "n")
cat("Propotion (%) of outliers:", round((na2 - na1) / sum(!is.na(var_name))*100, 1), "n")
cat("Mean of the outliers:", round(mo, 2), "n")
m2 <- mean(var_name, na.rm = T)
cat("Mean without removing outliers:", round(m1, 2), "n")
cat("Mean if we remove outliers:", round(m2, 2), "n")
# response <- readline(prompt="Do you want to remove outliers and to replace with NA? [yes/no]: ")
# if(response == "y" | response == "yes"){
# dt[as.character(substitute(var))] <- invisible(var_name)
# assign(as.character(as.list(match.call())$dt), dt, envir = .GlobalEnv)
# cat("Outliers successfully removed", "n")
# return(invisible(dt))
# } else{
# cat("Nothing changed", "n")
# return(invisible(var_name))
# }
}for (i in 1:length(colnames(select_if(dataset,is.numeric)))){
outlierKD(dataset, get(paste(colnames(select_if(dataset,is.numeric)[i])))) }Examine the top 25 rows of a column containing outliers Change the column name to the specified column
Outlier Handling Options
The Tukey method formula: The bottom threshold is: Q1 - 3IQR. All below are considered as outliers. The top threshold is: Q1 + 3(IQR). All above are considered as outliers.
The Hampel method formula: The bottom threshold is: median_value ??? 3(mad_value). All below are considered as outliers. The top threshold is: median_value + 3(mad_value). All above are considered as outliers.
Change columns to columns of interest
# Tukey method outlier thresholds
ukey_outlier(dataset$Fare)
# Hampel method outlier thresholds
hampel_outlier(dataset$Fare)Outliers Based on Thresholds
Change column names if using specific columns
## Set all values above the outlier threshold to the threshold value for all columns
prep_outliers(data = dataset, input = colnames(select_if(dataset,is.numeric)), method = "hampel", type='stop')
# Set all values above the outlier threshold to the threshold value for Specified columns
prep_outliers(data = dataset, input = c('Fare','Age'), method = "hampel", type='stop')Manage Duplicates
Count Dupes
Remove Duplicates
Synthetic Data Generation
synthpop
The ‘synthpop’ package is great for synthesizing data for statistical disclosure control or creating training data for model development. Other things to note,
- Synthesizing a single table is fast and simple.
- Watch out for over-fitting particularly with factors with many levels. Ensure the visit sequence is reasonable.
- You are not constrained by only the supported methods, you can build your own!
Sample data
original.df <- SD2011 %>% dplyr::select(sex, age, socprof, income, marital, depress, sport, nofriend,
smoke, nociga, alcabuse, bmi)
head(original.df)## sex age socprof income marital depress sport nofriend smoke
## 1 FEMALE 57 RETIRED 800 MARRIED 6 NO 6 NO
## 2 MALE 20 PUPIL OR STUDENT 350 SINGLE 0 NO 4 NO
## 3 FEMALE 18 PUPIL OR STUDENT NA SINGLE 0 NO 20 NO
## 4 FEMALE 78 RETIRED 900 WIDOWED 16 YES 0 NO
## 5 FEMALE 54 SELF-EMPLOYED 1500 MARRIED 4 YES 6 YES
## 6 MALE 20 PUPIL OR STUDENT -8 SINGLE 5 NO 10 NO
## nociga alcabuse bmi
## 1 -8 NO 30.8
## 2 -8 NO 23.4
## 3 -8 NO 18.4
## 4 -8 NO 30.5
## 5 20 NO 20.0
## 6 -8 NO 23.9
The objective of synthesizing data is to generate a data set which resembles the original as closely as possible, warts and all, meaning also preserving the missing value structure. There are two ways to deal with missing values:
- impute/treat missing values before synthesis
- synthesize the missing values and deal with the missing later
The second option is generally better since the purpose the data is supporting may influence how the missing values are treated.
Missing values can be simply NA or some numeric code specified by the collection. A useful inclusion is the syn function allows for different NA types, for example income, nofriend and nociga features -8 as a missing value. A list is passed to the function in the following form.
# setting continuous variable NA list
cont.na.list <- list(income = c(NA, -8), nofriend = c(NA, -8), nociga = c(NA, -8))By not including this the -8’s will be treated as a numeric value and may distort the synthesis. After synthesis, there is often a need to post process the data to ensure it is logically consistent. For example, anyone who is married must be over 18 and anyone who doesn’t smoke shouldn’t have a value recorded for ‘number of cigarettes consumed’. These rules can be applied during synthesis rather than needing adhoc post processing.
# apply rules to ensure consistency
rules.list <- list(marital = "age < 18", nociga = "smoke == 'NO'")
rules.value.list <- list(marital = "SINGLE", nociga = -8)The variables in the condition need to be synthesized before applying the rule otherwise the function will throw an error. In this case age should be synthesized before marital and smoke should be synthesized before nociga.
There is one person with a bmi of 450.
## sex age agegr placesize region edu
## 3953 FEMALE 58 45-59 URBAN 20,000-100,000 Lubelskie SECONDARY
## eduspec socprof unempdur income
## 3953 economy and administration LONG-TERM SICK/DISABLED -8 1300
## marital mmarr ymarr msepdiv ysepdiv ls depress
## 3953 MARRIED 4 1982 NA NA MOSTLY SATISFIED 1
## trust trustfam trustneigh sport nofriend smoke
## 3953 ONE CAN`T BE TOO CAREFUL YES YES YES 2 NO
## nociga alcabuse alcsol workab wkabdur wkabint wkabintdur emcc englang
## 3953 -8 NO NO NO -8 NO <NA> <NA> NONE
## height weight bmi
## 3953 149 NA 450
Their weight is missing from the data set and would need to be for this to be accurate. I don’t believe this is correct! So, any bmi over 75 (which is still very high) will be considered a missing value and corrected before synthesis.
# getting around the error: synthesis needs to occur before the rules are applied
original.df$bmi <- ifelse(original.df$bmi > 75, NA, original.df$bmi)The data can now be synthesized using the following code.
# synthesise data
synth.obj <- syn(original.df, cont.na = cont.na.list, rules = rules.list, rvalues = rules.value.list, seed = myseed)##
## Unexpected values (not obeying the rules) found for variable(s): nociga.
## Rules have been applied but make sure they are correct.
## Synthesis
## -----------
## sex age socprof income marital depress sport nofriend smoke nociga
## alcabuse bmi
## Call:
## ($call) syn(data = original.df, rules = rules.list, rvalues = rules.value.list,
## cont.na = cont.na.list, seed = myseed)
##
## Number of synthesised data sets:
## ($m) 1
##
## First rows of synthesised data set:
## ($syn)
## sex age socprof income marital depress sport nofriend
## 1 MALE 50 FARMER -8 MARRIED 5 YES 4
## 2 MALE 29 LONG-TERM SICK/DISABLED 600 SINGLE 4 YES 3
## 3 FEMALE 24 PUPIL OR STUDENT NA SINGLE 2 NO 5
## 4 MALE 62 RETIRED 3200 MARRIED 15 YES 5
## 5 MALE 84 RETIRED 2121 WIDOWED 14 YES 1
## 6 MALE 75 RETIRED -8 MARRIED NA YES 4
## smoke nociga alcabuse bmi
## 1 NO -8 NO 29.3
## 2 YES 15 NO 22.2
## 3 NO -8 YES 19.7
## 4 YES 30 NO 23.2
## 5 NO -8 NO 25.8
## 6 NO -8 NO 27.7
## ...
##
## Synthesising methods:
## ($method)
## sex age socprof income marital depress sport nofriend
## "sample" "cart" "cart" "cart" "cart" "cart" "cart" "cart"
## smoke nociga alcabuse bmi
## "cart" "cart" "cart" "cart"
##
## Order of synthesis:
## ($visit.sequence)
## sex age socprof income marital depress sport nofriend
## 1 2 3 4 5 6 7 8
## smoke nociga alcabuse bmi
## 9 10 11 12
##
## Matrix of predictors:
## ($predictor.matrix)
## sex age socprof income marital depress sport nofriend smoke
## sex 0 0 0 0 0 0 0 0 0
## age 1 0 0 0 0 0 0 0 0
## socprof 1 1 0 0 0 0 0 0 0
## income 1 1 1 0 0 0 0 0 0
## marital 1 1 1 1 0 0 0 0 0
## depress 1 1 1 1 1 0 0 0 0
## sport 1 1 1 1 1 1 0 0 0
## nofriend 1 1 1 1 1 1 1 0 0
## smoke 1 1 1 1 1 1 1 1 0
## nociga 1 1 1 1 1 1 1 1 1
## alcabuse 1 1 1 1 1 1 1 1 1
## bmi 1 1 1 1 1 1 1 1 1
## nociga alcabuse bmi
## sex 0 0 0
## age 0 0 0
## socprof 0 0 0
## income 0 0 0
## marital 0 0 0
## depress 0 0 0
## sport 0 0 0
## nofriend 0 0 0
## smoke 0 0 0
## nociga 0 0 0
## alcabuse 1 0 0
## bmi 1 1 0
The compare function allows for easy checking of the synthesized data.
# compare the synthetic and original data frames
compare(synth.obj, original.df, nrow = 4, ncol = 3, cols = mycols)$plotSolid. The distributions are very well preserved. Did the rules work on the smoking variable?
## nociga
## smoke -8 1 2 3 4 5 6 7 8 9 10 12 13
## YES 17 15 18 16 10 82 34 13 22 2 323 24 2
## NO 3692 0 0 0 0 0 0 0 0 0 0 0 0
## nociga
## smoke 14 15 16 17 18 20 21 24 25 29 30 35 40
## YES 5 144 1 1 9 431 1 1 48 1 44 2 33
## NO 0 0 0 0 0 0 0 0 0 0 0 0 0
## nociga
## smoke 50
## YES 1
## NO 0
They did. All non-smokers have missing values for the number of cigarettes consumed.
compare can also be used for model output checking. A logistic regression model will be fit to find the important predictors of depression. The depression variable ranges from 0-21. This will be converted to
- 0-7 - no evidence of depression (0)
- 8-21 - evidence of depression (1)
This split leaves 3822 (0)’s and 1089 (1)’s for modelling.
#MODEL COMPARISON
glm1 <- glm.synds(ifelse(depress > 7, 1, 0) ~ sex + age + log(income) + sport + nofriend + smoke +
alcabuse + bmi, data = synth.obj, family = "binomial")
summary(glm1)## Warning: Note that all these results depend on the synthesis model being correct.
##
## Fit to synthetic data set with a single synthesis.
## Inference to coefficients and standard errors that
## would be obtained from the observed data.
##
## Call:
## glm.synds(formula = ifelse(depress > 7, 1, 0) ~ sex + age + log(income) +
## sport + nofriend + smoke + alcabuse + bmi, family = "binomial",
## data = synth.obj)
##
## Combined estimates:
## xpct(Beta) xpct(se.Beta) xpct(z) Pr(>|xpct(z)|)
## (Intercept) -0.37376 0.71378 -0.52 0.6005
## sexFEMALE 0.26372 0.09979 2.64 0.0082 **
## age 0.09369 0.00385 24.36 < 0.0000000000000002 ***
## log(income) -0.79442 0.08859 -8.97 < 0.0000000000000002 ***
## sportNO -0.76773 0.12299 -6.24 0.00000000043 ***
## nofriend -0.02465 0.00845 -2.92 0.0035 **
## smokeNO -0.05681 0.11369 -0.50 0.6173
## alcabuseNO -0.46311 0.20668 -2.24 0.0250 *
## bmi 0.01152 0.01061 1.09 0.2776
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call used to fit models to the data:
## glm.synds(formula = ifelse(depress > 7, 1, 0) ~ sex + age + log(income) +
## sport + nofriend + smoke + alcabuse + bmi, family = "binomial",
## data = synth.obj)
##
## Differences between results based on synthetic and observed data:
## Std. coef diff p value CI overlap
## (Intercept) -0.360 0.719 0.908
## sexFEMALE -0.666 0.505 0.830
## age 1.835 0.067 0.532
## log(income) -0.319 0.750 0.919
## sportNO -0.690 0.490 0.824
## nofriend -2.230 0.026 0.431
## smokeNO -1.320 0.187 0.663
## alcabuseNO 1.309 0.191 0.666
## bmi 0.314 0.754 0.920
##
## Measures for one synthesis and 9 coefficients
## Mean confidence interval overlap: 0.744
## Mean absolute std. coef diff: 1
## Lack-of-fit: 14; p-value 0.123 for test that synthesis model is compatible
## with a chi-squared test with 9 degrees of freedom
##
## Confidence interval plot:
While the model needs more work, the same conclusions would be made from both the original and synthetic data set as can be seen from the confidence intervals. Occasionally there may be contradicting conclusions made about a variable, accepting it in the observed data but not in the synthetic data for example. This scenario could be corrected by using different synthesis methods or altering the visit sequence.
Preserving Count Data
Released population data are often counts of people in geographical areas by demographic variables (age, sex, etc). Some cells in the table can be very small e.g. <5. For privacy reasons these cells are suppressed to protect peoples identity. With a synthetic data, suppression is not required given it contains no real people, assuming there is enough uncertainty in how the records are synthesized.
The existence of small cell counts opens a few questions,
- If very few records exist in a particular grouping (1-4 records in an area) can they be accurately simulated by synthpop?
- Is the structure of the count data preserved?
To test this 200 areas will be simulated to replicate possible real world scenarios. Area size will be randomly allocated ensuring a good mix of large and small population sizes. Population sizes are randomly drawn from a Poisson distribution with mean \(\lambda\). If large, \(\lambda\) is drawn from a uniform distribution on the interval [20, 40]. If small, \(\lambda\) is set to 1.
# ---------- AREA
# add area flag to the data frame
area.label <- paste0("area", 1:200)
a <- sample(0:1, 200, replace = TRUE, prob = c(0.5, 0.5))
lambda <- runif(200, 20, 40)*(1-a) + a
prob.dist <- rpois(200, lambda)
area <- sample(area.label, 5000, replace = TRUE, prob = prob.dist)
# attached to original data frame
original.df <- SD2011 %>% dplyr::select(sex, age, socprof, income, marital, depress, sport, nofriend, smoke, nociga, alcabuse, bmi)
original.df$bmi <- ifelse(original.df$bmi > 75, NA, original.df$bmi)
original.df <- cbind(original.df, area) %>% arrange(area)The sequence of synthesizing variables and the choice of predictors is important when there are rare events or low sample areas. If Synthesized very early in the procedure and used as a predictor for following variables, it’s likely the subsequent models will over-fit. Synthetic data sets require a level of uncertainty to reduce the risk of statistical disclosure, so this is not ideal.
Fortunately syn allows for modification of the predictor matrix. To avoid over-fitting, ‘area’ is the last variable to by synthesized and will only use sex and age as predictors. This is reasonable to capture the key population characteristics. Additionally, syn throws an error unless maxfaclevels is changed to the number of areas (the default is 60). This is to prevent poorly synthesized data for this reason and a warning message suggest to check the results, which is good practice.
# synthesise data
# m is set to 0 as a hack to set the synds object and the predictor matrix
synth.obj.b <- syn(original.df, cont.na = cont.na.list, rules = rules.list, rvalues = rules.value.list, maxfaclevels = 200,
seed = myseed, m = 0)##
## Unexpected values (not obeying the rules) found for variable(s): nociga.
## Rules have been applied but make sure they are correct.
Changing the predictor matrix to predict area with only age and sex
new.pred.mat <- synth.obj.b$predictor.matrix
new.pred.mat["area",] <- 0
new.pred.mat["area",c("age", "sex")] <- 1
new.pred.mat## sex age socprof income marital depress sport nofriend smoke
## sex 0 0 0 0 0 0 0 0 0
## age 1 0 0 0 0 0 0 0 0
## socprof 1 1 0 0 0 0 0 0 0
## income 1 1 1 0 0 0 0 0 0
## marital 1 1 1 1 0 0 0 0 0
## depress 1 1 1 1 1 0 0 0 0
## sport 1 1 1 1 1 1 0 0 0
## nofriend 1 1 1 1 1 1 1 0 0
## smoke 1 1 1 1 1 1 1 1 0
## nociga 1 1 1 1 1 1 1 1 1
## alcabuse 1 1 1 1 1 1 1 1 1
## bmi 1 1 1 1 1 1 1 1 1
## area 1 1 0 0 0 0 0 0 0
## nociga alcabuse bmi area
## sex 0 0 0 0
## age 0 0 0 0
## socprof 0 0 0 0
## income 0 0 0 0
## marital 0 0 0 0
## depress 0 0 0 0
## sport 0 0 0 0
## nofriend 0 0 0 0
## smoke 0 0 0 0
## nociga 0 0 0 0
## alcabuse 1 0 0 0
## bmi 1 1 0 0
## area 0 0 0 0
Synthesizing New Predictor
synth.obj.b <- syn(original.df, cont.na = cont.na.list, rules = rules.list, rvalues = rules.value.list,
maxfaclevels = 200, seed = myseed,
proper = TRUE, predictor.matrix = new.pred.mat)##
## Unexpected values (not obeying the rules) found for variable(s): nociga.
## Rules have been applied but make sure they are correct.
## Synthesis
## -----------
## sex age socprof income marital depress sport nofriend smoke nociga
## alcabuse bmi area
Compare Synthetic and Original Data
The area variable is simulated fairly well on simply age and sex. It captures the large and small areas, however the large areas are relatively more variable. This could use some fine tuning, but will stick with this for now.
## sex
## area MALE FEMALE
## area1 3 0
## area100 20 40
## area101 23 33
## area102 11 30
## area104 9 18
## area107 18 37
## area108 6 8
## area109 0 4
## area110 32 41
## area111 12 14
## area112 6 3
## area113 21 18
## area114 0 2
## area115 4 0
## area117 11 33
## area118 0 0
## area12 0 0
## area120 16 21
## area121 16 30
## area124 37 39
## sex
## area MALE FEMALE
## area1 3 1
## area100 26 30
## area101 32 46
## area102 21 39
## area104 13 23
## area107 25 36
## area108 4 6
## area109 1 2
## area110 29 35
## area111 28 26
## area112 2 2
## area113 17 18
## area114 0 2
## area115 1 1
## area117 19 22
## area118 1 0
## area12 0 4
## area120 20 28
## area121 21 27
## area124 31 32
d <- data.frame(difference = as.numeric(tab.syn - tab.orig), sex = c(rep("Male", 154), rep("Female", 154)))
ggplot(d, aes(x = difference, fill = sex)) + geom_histogram() + facet_grid(sex ~ .) +
scale_fill_manual(values = mycols)The method does a good job at preserving the structure for the areas. How much variability is acceptable is up to the user and intended purpose. Using more predictors may provide a better fit. The errors are distributed around zero, a good sign no bias has leaked into the data from the synthesis.
tab.syn <- synth.obj.b$syn %>% dplyr::select(area, sex) %>% table()
tab.orig <- original.df %>% dplyr::select(area, sex) %>% table()The method does a good job at preserving the structure for the areas. How much variability is acceptable is up to the user and intended purpose. Using more predictors may provide a better fit. The errors are distributed around zero, a good sign no bias has leaked into the data from the synthesis.
## sex
## marital MALE FEMALE
## SINGLE 653 532
## MARRIED 1403 1606
## WIDOWED 61 505
## DIVORCED 56 153
## LEGALLY SEPARATED 10 0
## DE FACTO SEPARATED 5 13
## sex
## marital MALE FEMALE
## SINGLE 657 596
## MARRIED 1382 1597
## WIDOWED 66 465
## DIVORCED 62 137
## LEGALLY SEPARATED 6 1
## DE FACTO SEPARATED 4 18
At higher levels of aggregation the structure of tables is more maintained.
This material was derived from here: synthpop.